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ABSTRACT Six subspecies are currently recognized in Salmonella enterica. Subspecies I (subspecies enterica) is responsible for 
nearly all infections in humans and warm-blooded animals, while five other subspecies are isolated principally from cold- 
blooded animals. We sequenced 21 phylogenetically diverse strains, including two representatives from each of the previously 
unsequenced five subspecies and 1 1 diverse new strains from S. enterica subspecies enterica, to put this species into an evolu- 
tionary perspective. The phylogeny of the subspecies was partly obscured by abundant recombination events between lineages 
and a relatively short period of time within which subspeciation took place. Nevertheless, a variety of different tree-building 
methods gave congruent evolutionary tree topologies for subspeciation. A total of 285 gene families were identified that were 
recruited into subspecies enterica, and most of these are of unknown function. At least 2,807 gene families were identified in one 
or more of the other subspecies that are not found in subspecies I or Salmonella bongori. Among these gene families were 13 new 
candidate effectors and 7 new candidate fimbrial clusters. A third complete type III secretion system not present in subspecies 
enterica (I) isolates was found in both strains of subspecies salamae (II). Some gene families had complex taxonomies, such as 
the type VI secretion systems, which were recruited from four different lineages in five of six subspecies. Analysis of 
nonsynonymous-to-synonymous substitution rates indicated that the more-recently acquired regions in S. enterica are under- 
going faster fixation rates than the rest of the genome. Recently acquired AT-rich regions, which often encode virulence func- 
tions, are under ongoing selection to maintain their high AT content. 

IMPORTANCE We have sequenced 21 new genomes which encompass the phylogenetic diversity of Salmonella, including strains 
of the previously unsequenced subspecies arizonae, diarizonae, houtenae, salamae, and indica as well as new diverse strains of 
subspecies enterica. We have deduced possible evolutionary paths traversed by this very important zoonotic pathogen and iden- 
tified novel putative virulence factors that are not found in subspecies I. Gene families gained at the time of the evolution of sub- 
species enterica are of particular interest because they include mechanisms by which this subspecies adapted to warm-blooded 
hosts. 
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Salmonella spp. cause about 1.3 billion cases of nontyphoidal 
salmonellosis worldwide each year (1). The economic burden 
due to salmonellosis in the United States alone is estimated to be 
~$2.3 billion annually (2). Salmonella is also a major pathogen of 
domestic animals causing huge economic losses and providing a 
source of infection for humans (3). Sero typing-based identifica- 
tion of somatic (O) and flagellar (H) antigens was among the first 
methods used for taxonomic classification of Salmonella, and each 
serovar was initially considered a different species. However, cell 
surface antigens are sometimes horizontally transferred, a phe- 
nomenon that can cause classification of genetically unrelated 
strains within the same serovar (4, 5) . Salmonella taxonomy took a 
major stride when Falkow and colleagues (6) used DNA hybrid- 
ization to demonstrate that all tested serovars were related at the 
species level and identified five distinct subgenera within the spe- 
cies. Salmonella is now considered to consist of two species, Sal- 
monella bongori and Salmonella enterica, and S. enterica is further 



classified into six subspecies, arizonae (Ilia), diarizonae (Mb), 
houtenae (IV), salamae (II), indica (VI), and enterica (I) (7). S. en- 
terica subsp. enterica (I) strains represent the vast majority of Sal- 
monella strains isolated from humans and warm-blooded ani- 
mals, while all the other subspecies and S. bongori are more 
typically (though not exclusively) isolated from cold-blooded an- 
imals (8). Approximately 50 of the nearly 2,600 known Salmonella 
serovars account for -99% of all clinical isolates of Salmonella 
from humans and domestic mammals (9), and all of these 50 
serovars are in subspecies I. 

Genome sequencing efforts in S. enterica have so far focused on 
the most prevalent serovars of subspecies enterica (I). We have 
sequenced the genomes of eleven additional members of subspe- 
cies enterica (I), selected based on their diversity, and those of two 
different serovars from each of the other five known subspecies. 
We compared these sequences to the whole-genome sequences of 
S. bongori (10) and to seven previously sequenced subspecies en- 
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terica (I) strains. We construct phylogenetic hypotheses for these 
29 genomes while taking into account the high rate of recombina- 
tion among Salmonella strains (11-15). Because acquisition and 
loss of genes is a major force driving the evolution of virulence in 
Salmonella (16), we modeled the gain and loss of gene families at 
each ancestral node. We reconstructed hypothetical gene contents 
of the most recent common ancestor (MRCA) of each subspecies. 
Gene families gained at the subspecies enterica (I) node may pro- 
vide clues to the strategies and virulence factors that contributed 
to the formation of a lineage which has evolved to infect princi- 
pally warm-blooded hosts. We also identified gene families under- 
going accelerated evolution based on pairwise synonymous-to- 
nonsynonymous single nucleotide polymorphism (SNP) ratios. 
This group of gene families is particularly interesting because 
some of this selection may be driven by newly acquired life strat- 
egies or by interactions of the bacterium with the host. 

RESULTS AND DISCUSSION 

We sequenced 21 new Salmonella genomes, 10 of which were se- 
quenced to completion while the remaining genomes were se- 
quenced to obtain improved high-quality drafts. The strains and 
sequencing statistics are summarized in Table SI in the supple- 
mental material along with information regarding other previ- 
ously published strains and species to which these new genomes 
were compared. 

Among the 21 new genomes, two strains were selected from 
each of the five previously unsequenced subspecies. In addition, 
1 1 genomes were selected from 305 strains within subspecies en- 
terica (I) that lacked many genes found in S. enterica subsp. en- 
terica serovars Typhimurium LT2 and Typhi CT18 as well as 
strains representing distant genomovars within a single serovar 
based on comparative genomic hybridization (17-19) (data acces- 
sible at https://dl.dropbox.eom/u/99836585/MMCC_all_CGH 
_100407.xlsx). 

The fact that some of the genomes have not been sequenced to 
completion means that some sequencing errors, misassemblies, 
annotation errors due to collapsing of duplicate genes, and dupli- 
cated annotations of genes that span contig boundaries still exist 
in a few locations in a few genomes in this data set. The analyses we 
perform below are designed to mitigate but not eliminate these 
limitations. The numbers presented are all estimates constrained 
by these caveats, and some analyses, such as strict tests of orthol- 
ogy and studies of gene duplication, are not possible on our draft 
genomes. Nevertheless, the obtained high-quality drafts permit a 
fascinating insight into evolutionary processes during Salmonella 
subspeciation. 

Phylogenetic analysis. We used three different approaches to 
predict phylogenetic relationships between orthologous "core" 
regions shared by all subspecies of Salmonella. We first used 
Mauve (20) to align the 29 Salmonella genomes included in our 
study and used Escherichia coli K-12 as an outgroup. We identified 
737,062 SNPs in the "core" regions of -2.6 Mb that were present 
and aligned with high confidence in all taxa. We used these data to 
construct a bootstrapped maximum likelihood (ML) tree using 
PvAxML (version 7.2.6) (21). Figure 1A shows the cladogram con- 
structed using this approach. The relationship between the sub- 
species was supported in all 1,000 bootstrap replicates using ran- 
dom samples of 50% of the SNP data. 

To estimate the divergence times of each subspecies, codon 
alignments were constructed for 2,025 genes present across all 30 



genomes in single copies, as annotated by automated RAST (22) 
or annotated in the publically available genomes. A total of 
348,642 synonymous SNPs, which did not change the amino acid 
sequence, were identified. Figure IB shows a condensed and lin- 
earized version of the tree built using these SNPs (with 1 ,000 boot- 
straps on 50% of the data), calibrated based on a previously esti- 
mated 140-million-year divergence time between Salmonella and 
E. coli (23). The topology of this tree (Fig. IB) was in agreement 
with the tree of all "core" SNPs (Fig. 1A). 

Using synonymous SNPs, it was estimated that subspecies en- 
terica (I) diverged from its last common ancestor -27 mil- 
lion years ago and that the most recent common ancestor 
(MRCA) of all analyzed subspecies enterica strains arose -12 mil- 
lion years ago. This estimate supports the possibility that the sub- 
species evolved long after their respective preferred hosts. 

As a distinct alternative strategy to determine phylogeny, we 
built individual maximum likelihood (ML) trees for each of the 
2,025 core genes. DNA sequences for each potentially orthologous 
gene family were aligned using Muscle (version 3.8.1) (24), and 
ML trees were constructed using RAxML (version 7.2.6) with 
1,000 bootstraps on 50% of the data (21). A consensus tree based 
on these 2,025 trees was generated in Phylip (version 3.69) (25) 
using the extended majority consensus rule (Fig. 1C). The num- 
bers on the internal nodes indicate the fractions of gene trees 
which support the partition of the taxa at that node. The low 
numbers of compatible trees at each node indicate a high level of 
incongruence between individual gene trees and the consensus 
tree. Nevertheless, the consensus tree for subspeciation displayed 
the same topology as the SNP trees in Fig. 1A and B. There was 
generally more congruence at each terminal subspecies node than 
at ancestral nodes. This may either reflect that the subspecies are 
acting like incipient species, with a barrier to intraspecies recom- 
bination, or indicate that there was insufficient time in which re- 
combination could have taken place after the subspecies arose, or 
both. 

Two mechanisms that can introduce phylogenetic discordance 
between orthologous genes in gene trees and species trees are in- 
tergenomic recombination (gene conversion) and incomplete lin- 
eage sorting (ILS) (26). Given the relatively short branch lengths 
within which the subspecies arose (Fig. IB), incomplete lineage 
sorting as a source of gene tree discordance is possible. However, 
gene conversion events can reintroduce previously lost allelic 
changes into recombining populations. It is not possible to distin- 
guish between these two mechanisms using data from contempo- 
rary Salmonella strains. 

Salmonella genomes undergo frequent intergenomic recombi- 
nations (11-15) and thereby disrupt the clonal inheritance pattern 
within the recombined regions. Clonal Frame (27) takes recom- 
bination into account during inference of phylogenetic relation- 
ships. Clonal Frame also attempts to reconstruct the mutation and 
recombination events that give rise to a particular lineage on the 
branches of the phylogenetic tree. The program is computation- 
ally intensive; therefore, it was necessary to use only a subset of all 
available data. Core genome regions of greater than 10 kb that 
were shared by all 30 taxa were extracted from Mauve-based 
whole-genome alignments. This resulted in extraction of 8 blocks 
whose total alignment size was 104,029 bp, representing -4% of 
the core genome, which was used as an input for Clonal Frame. 
Two replicate runs of 200,000 Markov chain Monte Carlo 
(MCMC) iterations were performed. The tree topology for the 
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FIG 1 Evolution of Salmonella subspecies, as revealed by different phylogenetic tree-building algorithms. (A) Maximum likelihood cladogram. An alignment 
of an ~2.6-Mb core sequence conserved across all genomes (737,062 SNPs) was used, and the presented cladogram was constructed using RAxML version 2.7.6 
(21). Internal nodes show bootstrap support values from 1,000 replicates. (B) Condensed and linearized maximum likelihood phylogram. Exactly 348,642 
synonymous substitutions in 2,025 core genes present across all genomes in single copies were used. Distances were estimated using RAxML version 2.7.6 with 
1,000 bootstraps (21), and the temporal calibration is based on the 140-million-year divergence time between E. coli and Salmonella reported previously (23). (C) 
Majority-rule consensus cladogram of 2,025 core phylogenetic trees constructed using Phylip version 3.69 (25). Internal nodes indicate the fractions of gene trees 
which support the partition of the multiple taxa at that node. (D) Condensed phylogram obtained from a Clonal Frame (27) analysis of concatenated 104,029-bp 
alignment. The tree represents a consensus of two replicate analyses of 200,000 Markov chain Monte Carlo iterations. 



subspecies obtained from Clonal Frame (Fig. ID) predicted a 
branching order of the subspecies similar to that of the other tree- 
building methods and further suggested that, within subspecies 
enterica (I), the previously defined clade B is a separate ancient 
lineage (13, 28, 29). Five of the subspecies enterica (I) genomes we 
sequenced proved to be new members of clade B. No additional 
consensus was achieved for the branching order within the strains 
we used in subspecies enterica (I). 

Recombination. The importance of recombination in the evo- 
lution of S. enterica is well established (11-15). Clonal Frame anal- 
ysis suggested that recombination played an equally important 
role as mutation in the evolution of the subspecies 
(recombination-to-mutation ratio [r/m] = 0.94). We used the 
Recombination Detection Program (RDP) (30) to determine the 
most likely regions of recombination in each genome. This tool 
suggested hundreds of such events per genome (data not shown). 
As an illustration, we annotated the potential gene conversion 
events in just one strain, the archetype strain for S. enterica subsp. 



enterica (I) serovar Typhimurium — strain LT2 (31). We anno- 
tated only the recombination events in which the potential source 
of the sequence in LT2 was not within subspecies enterica (I). This 
information is presented in Table S2 in the supplemental material. 
In brief, approximately 120 conversion events encompassing 
around 347 kb (336 genes) of S. Typhimurium LT2 sequence were 
predicted to have a source outside subspecies enterica (I). 

Gain and loss of genes during the evolution of Salmonella. 
Gain and loss of genes is one of the major mechanisms that drives 
diversification within bacteria (32). Homologous gene families 
among the 29 Salmonella genomes were identified using Fas- 
tOrtho (33), a reimplementation of OrthoMCL (34), and is avail- 
able at http://enews.patricbrc.org/fastortho/. For the 29 Salmo- 
nella genomes covering the known phylogenetic diversity of 
Salmonella, we identified 1 1,443 FastOrtho gene families (includ- 
ing orthologs and close paralogs) in the pangenome, of which 
3,22 1 gene families were conserved across all analyzed strains. Fig- 
ure 2 shows the rate of identification of new gene families observed 



March/April 2013 Volume 4 Issue 2 e00579-12 



Bio' mbio.asm.org 3 



Desai et al. 




• Pan-genome 
-■- Strain specific genes 



A A ,> A A A 



<y V <V V IT <P <V? <V V V 




cf- A c® .<& 



&\<F.4°.&'.<$'. 



FIG 2 Gene accumulation enumerations and rarefaction across 29 Salmonella strains. (A) Rarefaction 
curves were estimated by bootstrapping 100 permutations of randomized sample order. Error bars 
indicate the bootstrap standard deviation (SD) based on variation in sample order among randomiza- 
tions. Rarefaction curves were calculated using Estimates (35). See the text for details. (B) Gene accu- 
mulation curve calculated based on the presence/absence gene profile. Strains were ranked in ascending 
distance from the Typhimurium LT2 reference genome. Core and pangenomes were estimated after 
addition of one strain at a time. 



with each new sequence, based on a rarefaction curve that was 
calculated using Estimates (35), where the number of gene fami- 
lies equals 4,922.9 X (number of genomes) 02492 and R 2 = 0.9985. 
This approximation, which is highly reliant on the level of diver- 
sity already sampled, estimates that about 100 ± 63 (mean ± SD) 
new families would be observed in the next additional Salmonella 
genome sequence. 

Some of the genomes we analyzed are incomplete, leading to 
some annotation inconsistencies, especially at contig boundaries 
and in multicopy genes. To reduce the rate of false negatives (i.e., 
calling a gene absent because it was not annotated) , we augmented 
the phyletic pattern with a tblatx analysis as detailed in Materials 
and Methods. The augmented phyletic pattern for all identified 
gene families (see Table S2 in the supplemental material) was cre- 
ated based on the presence and absence of a homologous gene 
across all strains. The DNA sequence of an exemplar for each 
family is included in Table S2, and Table S3 contains the locus tags 
of all annotated genes that belong to each gene family. These data 
were used to model gain and loss of gene families at all nodes of the 
phylogenetic hypothesis (Fig. 1 A) by computing posterior proba- 
bilities for gain and loss of gene families at inner nodes using 
Count (36). Gitools (37) was then used to estimate statistical over- 



representation of gene families gained 
and lost at each ancestral node within 
specific gene sets. Figure 3 shows the gain 
and loss of specific gene sets that were sta- 
tistically enriched (false-discovery rate 
[FDR], <0.1) at major ancestral nodes. 
Figure 4 summarizes the presence/ab- 
sence profile of important gene sets 
across the 30 strains analyzed. A complete 
list of all predicted gene family gains and 
losses can be found in Table S4. Using 
Count (36), 3,379 gene families within 
the observed pangenome were predicted 
to have been gained more than once (ho- 
moplasies) across the evolutionary tree. 
Hence, at least -30% of the pangenome 
appears to be shared within the genus by 
multiple horizontal transfers. Gene sets 
enriched for harboring homoplastic gene 
families are indicated in Fig. 4. This list is 
dominated by virulence factors (SPI-7, 
-8, -10, -12, -13, -14, and -19 and the ye- 
rsiniabactin cluster) and includes eleven 
fimbrial clusters and six uncharacterized 
islands. Metabolic traits enriched for ho- 
moplastic gene families include allantoin 
utilization and inositol catabolism. 

Due to some fragmented genes being 
annotated twice at contig boundaries and 
the collapse of some regions of known 
duplication in the incomplete genomes, 
we refrained from modeling gene dupli- 
cations onto the phylogenetic tree. How- 
ever, because fimbrial gene clusters are of 
general interest due to their many paral- 
ogs and their suspected role in pathoad- 
aptation, we manually curated the phyl- 
etic pattern of genes in fimbrial gene 
families, taking into account genomic context whenever possible 
(see Table S5 in the supplemental material). For all other genes, 
the individual gene trees provided in Table S2 can be used to 
resolve the paralogs on a gene-by-gene basis. 

Gene families gained by the most recent common ancestor of 
all Salmonella subspecies. Count (36) predicted that the most 
recent common ancestor (MRCA) of all Salmonella subspecies, 
gained an estimated 657 gene families (including orthologs and 
close paralogs) while diverging from a common ancestor with 
Escherichia. A total of 454 of these gene families are present in all 
29 analyzed Salmonella genomes. A total of 347 out of the 657 
genes were estimated to be gained in 90 islands of two or more 
genes. Important acquisitions from a virulence perspective in- 
cluded SPT1 (encoding a type III secretion system [T3SS] and 
effectors required for invasion of eukaryotic cells [38]), SPI-4 (a 
type I secretion system for toxin delivery [38] ), five genes in SPI-5 
(T3SS effectors [38]), the tetrathionate respiration cluster (39), 
and the Bcf fimbrial cluster required for colonization of Peyer's 
patches (40). Other genes gained included 9 glycosyl hydrolases, 
41 genes with cyclic AMP receptor protein (CRP) binding sites 
(genes involved in carbohydrate catabolism under catabolite re- 
pression), and 9 genes (three operons) coding for anaerobic re- 
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FIG 3 Gene sets gained and lost at major ancestral nodes. Only statistically 
overrepresented gene sets (FDR < 10%) are shown. Subspecies enterica (I) is to 
the left of the vertical red line. Each cell is colored based on the ratio of genes 
gained/lost to the total number of genes in the gene set. See Table S7 in the 
supplemental material for identification of genes belonging to each gene set. 



ductases (other than the tetrathionate reductase), which enable 
use of alternate electron acceptors. Hence, after diverging from 
E. coli, the most recent common ancestor of all Salmonella subspe- 
cies appears to have gained genes that enable utilization of differ- 
ent carbohydrates under anaerobic conditions. A total of 224 hy- 
pothetical proteins with no currently attributed function were also 
gained at this node, of which 145 were conserved in all 29 Salmo- 
nella strains analyzed. The latter class is a particularly intriguing 
subgroup for further study. 

Genes gained and lost by the hypothetical S. bongori ances- 
tor. Using an assumption of maximum parsimony, S. bongori re- 
tained all ancestral genes gained by the MRCA of Salmonella but 
lost 63 genes ancestral to the MRCA of Salmonella and Escherichia. 
These losses included islands required for ethanolamine utiliza- 
tion (STM2454-STM2470), fucose catabolism (STM2974- 
STM2979), and nitrate and nitrite ammonification (STM4277- 
STM4280). Hence, S. bongori either failed to gain or lost specific 
carbon utilization and anaerobic respiratory pathways that may 
contribute to survival in an inflamed gut (41). This species also 
gained five fimbrial clusters. S. bongori gained a phylogenetically 
distinct type VI secretion system (54736. 3.peg. 1295- 
54736.3.peg.l308) which seems to remain unique to S. bongori 
(10). In addition to the T3SS effectors gained by the MRCA of 
Salmonella, S. bongori gained seven additional effectors, all of 
which were apparently gained in multiple other lineages (ho- 
moplasies). These data are illustrated in Table S6 in the supple- 
mental material, which lists the absence/presence status of all po- 
tential effector gene families identified in Salmonella. 

Gene families gained by the most recent common ancestor of 
the species S. enterica. Under the assumption of maximum par- 
simony, when S. enterica separated from the MRCA of Salmonella, 
it gained 213 gene families (including orthologs and close paral- 
ogs), of which about 120 gene families were gained in 15 islands of 
two or more genes, for a total of approximately 108 insertion 
events. Gene families gained in islands of three or more included 
SPI-2 (STM1391-STM1422), which triggers colitis in the warm- 
blooded Salmonella host (42) and enables survival and replication 
in macrophages (42, 43), a vitamin B 12 synthesis and propanediol 
utilization cluster (STM2019-STM2058) (44), a salmochelin- 
mediated iron acquisition cluster (STM2773-STM2777) that re- 
sults in salmochelin-mediated lipocalin resistance in an inflamed 
environment (45), a putative sialic acid metabolism cluster 
(STM1127-STM1131), and three more islands (Fig. 3) encoding 
putative metabolic clusters. Although the putative sialic acid me- 
tabolism cluster has not been associated with virulence, its gain 
might also provide a metabolic advantage in the usually sialic acid- 
rich environment of the gut (46). A total of 71 hypothetical pro- 
teins were also gained at the node, of which 38 were conserved 
across all 28 S. enterica strains. The latter genes may define new 
capabilities that are preserved in all S. enterica strains. 

Gene families gained and lost by the MRCA of S. enterica 
subsp. enterica (I). Among all the subspecies, strains in subspecies 
enterica (I) cause the vast majority of infections in warm-blooded 
animals. Hence, genes gained at this ancestral node might be im- 
portant for the evolution of virulence in mammalian and bird 
hosts. A total of 285 gene families (including orthologs and close 
paralogs) were modeled as being gained by the MRCA of subspe- 
cies enterica (I), of which 167 were unique to that subspecies while 
118 genes may have been acquired by multiple lineages through 
horizontal transfer (homoplasies) (see Table S4 in the supplemen- 
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FIG 4 Phyletic distribution of major groups of genes. Gene classes that include novel genes are shown in red. The number of genes in a gene set and the 
proportion of homoplastic genes within each gene set are shown on the right. The FDR (<10%) for homoplasy enrichment is in green. Serovars of subspecies 
enterica fl) are shown to the left of the vertical red line. Each cell is shaded based on the ratio of genes present, compared to the total number of genes in the gene 
set. See Table S7 in the supplemental material for identification of genes belonging to each gene set. 
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tal material). A total of 185 gene families were gained in 47 islands 
of two or more genes; hence, subspecies enterica (I) gained 285 
gene families in about 147 events. 

The majority of genes gained by the common ancestor of sub- 
species enterica (I) had little or no homology to known functions. 
Among identifiable genes, notable gains without any homoplasies 
included SPI-6 (a type VI secretion system and Saf fimbriae), a 
2-aminoethylphosphonate metabolism cluster (STM0426- 
STM0432), two islands (STM3779-STM3785 and STM4065- 
STM4080) needed for unknown carbohydrate utilization, and the 
autoinducer 2 (AI-2) transport and processing (STM4072- 
STM4079) cluster. No subspecies enterica (I)-specific effectors 
were discovered in this analysis, indicating that the acquisition of 
effectors may not have been part of the driving force in the foun- 
dation of this subspecies. 

The role of SPT6 in macrophage survival and cell invasion 
has been documented previously (44, 47), while the 
2-aminoethylphosphonate cluster has been reported to be essen- 
tial for long-term persistence in mice (48). STM3779-STM3785 is 
an AT-rich island and encodes a complete phosphotransferase 
system (PTS) with unknown specificity, along with a transcrip- 
tional regulator and a fructose-bisphosphate aldolase. This island 
was reported to be under histone-like nucleoid structuring pro- 
tein (H-NS) -mediated repression (49). An analysis of all publi- 
cally available expression data for S. Typhimurium using Colom- 
bos (50) suggested that the expression of this island was induced 
upon treatment with H 2 0 2 (51). Hence, this island might be im- 
portant during intracellular survival of Salmonella. 

STM4065-STM4071 codes for a permease, a transcriptional 
regulator, and metabolic enzymes for utilization of unknown car- 
bohydrates. Colombos analysis revealed that expression of this 
cluster was induced during swimming motility compared to 
swarming motility in S. Typhimurium (Gene Expression Omni- 
bus data set GSE1633 at http://www.ncbi.nlm.nih.gov/geo/). A 
recent screen of a complete single-gene deletion library in mice 
revealed that four genes in this island were of functional impor- 
tance for S. Typhimurium survival in the liver of mice (M. Mc- 
Clelland, unpublished data). 

The role of AI-2 in Salmonella signaling and metabolism is not 
clear (52). The only known regulon under direct AI-2-mediated 
transcriptional control codes for active uptake and processing of 
AI-2 (53) . Colombos analysis suggested that the AI-2 loci might be 
under orcA-mediated negative regulation in anaerobic conditions 
(54). LuxS-mediated production of AI-2 is conserved across all 
Salmonella spp., but the mechanism to sense AI-2 is present only 
in subspecies enterica (I) and absent in all other subspecies as well 
as in S. bongori. However, orthologs for AI-2 transport and pro- 
cessing are present in many other Enterobacteriaceae. 

Gene families specific to other subspecies. A total of 2,807 
gene families, which were found in one or more of the other five 
S. enterica subspecies but not present in any member of S. enterica 
subsp. enterica (I) or S. bongori, were identified. The majority of 
these gene families had little or no homology to known functions. 
A comprehensive list can be easily extracted from data presented 
in Table S2 in the supplemental material. Novel gene clusters en- 
coding putative virulence functions include the following: 

• SPT20 (55), which encodes a type VI secretion system pres- 
ent only in subsp. arizonae (Ilia) 

• SPT21 (55), which encodes a type VI secretion system pres- 



ent only in subsp. arizonae (Ilia) and subsp. diarizonae 
(Illb) 

• The locus of enterocyte effacement (LEE) (56), which — 
within Salmonella — is specific to subsp. salamae (II). The 
locus harbors a complete T3SS and is homologous and syn- 
tenic to the LEE in E. coli 0157:H7. However, homologs of 
only 5 of the 24 known effectors translocated through this 
machinery in E. coli 0157:H7 (57, 58), namely, fir, nleD, 
cseAB, cesD2, and cesT, were present in subsp. salamae. 

• Seven new fimbrial operons (named sib, sic, sas, sdf, sbs, sti, 
and ssf) (Fig. 4) 

• A novel family of cytolethal distending toxins (523839.5 
.peg.1805, 523839.5.peg.l806, 523839.5.peg.l807) which was 
specific to subsp. arizonae (Ilia) and subsp. diarizonae (Illb). 
This cluster codes for all three subunits of a functional multi- 
meric toxin. The cluster is homologous to the cdt cluster found 
mE. coli (59). 

• A total of 13 new putative effector families (see Table S6 in 
the supplemental material). Eight of these were specific to 
subsp. salamae (II), five of which were part of the LEE. One 
of these eight effectors (523838. 4.peg.2994) may encode a 
chimeric protein with parts similar to EspH from E. coli 
O103:H25 and SopA from subsp. enterica (I). 

Gene families with higher evolutionary rates. Different parts 
of the genomes diverge at different rates under different evolu- 
tionary pressures. To identify proteins evolving at a higher rate 
than the rest of the pangenome, pairwise ratios of synonymous 
substitutions per synonymous sites to nonsynonymous substitu- 
tion per nonsynonymous sites (dS/dN) were calculated for all pro- 
tein coding genes within each gene family for all pairs of taxa. The 
pairwise ratios within each gene family were averaged to assign a 
mean evolutionary rate to each family. Using this approach, we 
observed that recent protein coding gene acquisitions evolve faster 
than ancestral genes after normalizing for the rate of neutral di- 
vergence. Figure 5 shows the distribution of mean log 10 values 
(dS/dN) for all nonhomoplastic genes that were recruited into the 
genomes at different times. There was a statistically significant 
difference (P < 0.01) in the evolutionary rate of genes gained at 
different evolutionary time points. Younger proteins evolved 
faster than ancient proteins. We also calculated the averages of 
log 10 values (dS/dN) within only subspecies enterica (I), in which 
this same trend was also observed (data not shown). Genes ac- 
quired through horizontal gene transfer (HGT) have previously 
been observed to evolve faster than core genes in E. coli (60). 

An enrichment analysis of the pangenome ranked by mean 
log 10 dS/dN values was performed to estimate the overabundance 
of fast-evolving proteins within gene sets grouped by a variety of 
criteria. This analysis (Table 1) suggested a higher proportion of 
fast-evolving proteins among genes with abnormal GC content, 
genes encoding plasmid-related functions (type IV secretion, 
toxin-antitoxin systems), and genes important for the interaction 
of Salmonella with the host. Among the known SPIs, SPI-2, SPI-6, 
SPI-7, SPI-8, SPI-10, and SPI-12 were also enriched for fast- 
evolving proteins. 

Evolution of differences in GC content across the genome. 
The ancestral orthologous regions shared by Salmonella and E. coli 
have evolved to contain a different GC content, with Salmonella 
genomes being relatively more GC rich (see Fig. SI in the supple - 
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FIG 5 Mean log 10 values (dS/dN) for genes recruited into the genomes at 
different time points. The time at which the genes were recruited into their 
respective genomes was estimated based on the tree in Fig. IB. 



TABLE 1 Statistically enriched gene sets containing a 
disproportionately high number of genes evolving faster than the rest of 
the pangenome 
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mental material). In contrast, laterally transferred islands ac- 
quired by Salmonella since divergence from E. coli, which often 
contain virulence functions, are usually very AT rich compared to 
the rest of the genome (49). Genetic acquisitions within the Sal- 
monella pangenome after the speciation of Salmonella from E. coli 
are enriched for AT-rich regions (Fig. S2). An H-NS-mediated 
regulatory mechanism controls the transcription of AT-rich re- 
gions (49). Hence, newly acquired AT regions might need to re- 
main AT rich or become even more AT rich if they were to rely on 
this regulatory mechanism. Table 2 shows the maximum compos- 
ite likelihood (M CL) estimate of the pattern of nucleotide substi- 
tution of three sets of genes; the concatenated alignment of 2,025 
genes shared by all the Salmonella and E. coli strains (the core), the 
AT-rich SPI-1, and the similarly AT-rich SPT4. The core has 
evolved from a lower GC to maintain a higher GC in Salmonella, as 
evidenced by a higher AT-to-GC substitution rate than GC-to-AT 
substitution rate (46.6% versus 39.1%). 
In contrast, the two AT-rich islands 
(SPI-1, SPI-4) have a substitution ratio 
that maintains them at their current level 
of AT richness (37.8% versus 47.2% for 
SPI-1 and 33.3% versus 49.2% for SPI-4), 
consistent with selective pressure to 
maintain regulation by the silencer of AT- 
rich regions, H-NS. 

We investigated SPI-1 T3SS homologs 
in a wide variety of species. Enterobacteri- 
aceae, such as Escherichia, Shigella, Citro- 
bacter, and Erwinia, are closely related to 
Salmonella, and the T3SS homologs are 
similarly AT rich. However, in certain 
other high-GC organisms, such as Pseu- 
domonas, Burkholderia, Xanthomonas, 
Chromobacterium, and Bordetella, xe- 
nologs for the T3SS have GC contents 
similar to that of their host genome (data 



not shown). Hence, even though H-NS homologs are widely dis- 
tributed (61) and present in all the above-mentioned organisms, 
different groups of bacteria seem to tolerate foreign genes with 
distinct GC content via different mechanisms. 

Conclusion. We predicted the evolutionary descent of each 
Salmonella enterica subspecies using three distinct approaches. All 
three methods were congruent in predicting the subspeciation 
topology. Highly incongruent individual gene trees suggested sig- 
nificant ancestral recombination during the evolution of the sub- 
species. We also identified lineage-specific gene clusters and genes 
undergoing accelerated evolution. 

Each of the subspecies has a distinct repertoire of genes. 
Some convergence is seen in independently acquired, phyloge- 
netically distinct but functionally similar gene clusters like 
T6SSs, a few functionally redundant T3SS effectors, fimbrial 



TABLE 2 AT-to-GC and GC-to-AT nucleotide substitutions in 29 Salmonella genomes, 
compared to the predicted state of the common ancestor" 
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a Three sets are shown: (i) the concatenated core codon alignments, (ii) the AT-rich SPI-1, and {iii) the AT-rich 
SPI-4. Maximum composite likelihood estimates were compared for 2,025 single-copy genes, the AT-rich SPI-1 
(3009898 to 3046643 in LT2}, and SPI-4 (4477857 to 4501818 in LT2). The observed GC is the average GC content 
of all sequences compared. % equilibrium GC equals the rate of AT-to-GC substitutions divided by the sum of the 
rate of AT-to-GC substitutions and the rate of GC-to-AT substitutions. Gaps and missing data were excluded from 
the analysis. 
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clusters, iron acquisition systems, cytolethal distending toxins, 
and anaerobic respiratory reductases. 

Genes recruited at the subspecies enterica (I) node may ulti- 
mately give clues to the adaptation of this lineage into warm- 
blooded animals. However, of 167 nonhomoplastic genes ac- 
quired at the subspecies enterica (I) node, 61 are hypothetical 
proteins and 93 have poorly defined putative functions. Func- 
tional characterization of these genes may reveal new mechanistic 
details of how subspecies enterica (I) adapted to warm-blooded 
hosts. 

MATERIALS AND METHODS 

Growth conditions and DNA isolation. Bacterial strains were grown un- 
der standard conditions, at 37°C in Luria broth, with aeration. Genomic 
DNA for sequencing was prepared from stationary cultures using the 
GenElute genomic DNA isolation kit for bacteria (Sigma) by closely fol- 
lowing the manufacturer's recommendations. 

Genome sequencing and assembly. Genomes were sequenced and 
assembled at The Genome Institute (Washington University). Sequencing 
technology used and coverage obtained for each genome are shown in 
Supplementary Table SI. The sequencing reads for S. enterica serovar 
Paratyphi B BAA 1250, S. enterica subsp. diarizonae BAA639, and S. en- 
terica serovar Enteritidis BAA1734 were assembled using the PCAP as- 
sembly program (62), while the rest of the genomes were assembled using 
the Newbler assembly program (454 Life Sciences, CT). All assemblies 
were run through a contamination screening process, which involves gen- 
eration of %GC plots based on the assembly consensus, blasting the as- 
sembly consensus against the nucleotide database (NCBI) and ribosomal 
DNA database (63), identification and removal of contaminating se- 
quences, and regeneration of new assembly output files and statistics after 
contamination removal. The finishing phase, which includes closure of 
gaps, resolution of ambiguous bases, and correction of misassembled re- 
gions, was completed for the assemblies of Salmonella enterica subsp. in- 
dica BAA1576, Salmonella enterica subsp. houtenae BAA1580, S. Enteriti- 
dis BAA1587, S. Paratyphi B BAA 1250, S. enterica subsp. diarizonae 
BAA639, and S. enterica subsp. arizonae BAA73 1 . The remaining genome 
sequences were improved to high-quality drafts by automated and man- 
ual review and editing of the sequence data and orientation of contigs. 

Genome annotation and pangenome analysis. For each genome, 
contigs from the de novo assemblies were ordered and concatenated into a 
pseudocontig, using the S. Typhimurium LT2 genome as the reference. 
Mummer (64) was used to align and order the contigs to the reference, 
and the alignment was parsed using custom Perl scripts. The ordered 
contigs were concatenated by inserting 50 N between contig breaks. These 
sequences were submitted to RAST (22) for automated annotation. Sieve 
(65) was used for de novo prediction of putative T3SS effectors. Putative 
effectors predicted using SIEVE were also analyzed using BPBAac (66), 
T3MM (58), and Effective T3 (67). Genes either annotated as effectors by 
RAST based on homology to known effector families or predicted to be 
effectors by at least 2 of the 4 methods tested were considered to be puta- 
tive T3SS effectors. 

Orthologous and paralogous gene families were initially constructed 
based on RAST annotations. An "all-against-all" tblatx matrix was con- 
structed using BLAT (68). The blat matrix was used as an input to con- 
struct orthologous/paralogous gene families using FastOrtho (http: 
//enews.patricbrc.org/fastortho/), which is a faster reimplementation of 
OrthoMCL (34). A representative nucleotide sequence from each gene 
family was collated with strain-specific genes not present in any gene 
families. This set of representative sequences was again subjected to a 
tblatx analysis against the whole-genome nucleotide sequences to identify 
candidate genes that may have been missed by RAST. A gene and/or close 
paralog was considered "present" if it had at least 90% nucleotide se- 
quence identity covering at least 20% of the sequence length. The phyletic 
table generated from the tblatx analysis was consolidated with the phyletic 
table generated from the OrthoMCL analysis to compute a comprehen- 



sive pangenome matrix. Rarefaction curves for pangenome size estimates 
were calculated based on the phyletic pattern, using Estimates (version 
8.2.0) (35). 

A representative nucleotide sequence from each gene family was used 
for annotation augmentation using a variety of different methods. SignalP 
(69) was used to annotate sequences for presence of signal peptides, while 
TMHMM (70) was used to predict presence of transmembrane helices. 
Representative sequences from individual gene families were also anno- 
tated using KAAS (71) to overlay KEGG pathways onto the pangenome. 
Gene sets for metabolic pathways were constructed based on the KEGG 
and SEED annotations (22) from RAST. Other gene sets were constructed 
based on Virulence Factor Database (VFDB) annotations (72) and man- 
ual curation. Membership of each gene family within individual gene sets 
is listed in Table S7 in the supplemental material. 

Gain/loss analysis. The pangenome matrix was used as an input for 
Count (36) to calculate posterior probabilities for gain and loss of each 
gene family across all nodes of the phylogenetic hypothesis. The posterior 
probability matrix was converted into a binary matrix using a threshold of 
0.75. This binary matrix was used to estimate overabundance of gain or 
loss of gene families within specific gene sets across major ancestral nodes 
using Gitools (37). Within Gitools, the Fisher exact test was used to esti- 
mate the significance of enrichment of specific gene sets, and false- 
discovery rates (FDR) were used to adjust the P values for multiple com- 
parisons. 

Any gene family gained more than once across the phylogenetic tree 
was marked as a homoplastic gene family. This binary information was 
used as an input with Gitools (37) to assess enrichment of homoplastic 
genes within specific gene sets. 

Evolutionary rate of gene families. Individual gene sequences (DNA 
and amino acid sequences) for all genes in gene families predicted by 
FastOrtho were extracted using custom Perl scripts. Codon alignments 
within gene families were constructed using PAL2NAL (73). Amino acid 
alignments needed as input for PAL2NAL were constructed using Muscle 
(24). The dS/dN ratios for all possible pairwise comparisons within a gene 
family were calculated based on the codon alignments using SNAP (74). 
Mean dS/dN ratios were assigned for individual gene families by averaging 
all pairwise ratios within each family. The pangenome was ranked based 
on log 10 dS/dN ratios, and enrichment analysis for gene sets containing 
relatively fast-evolving gene families was conducted using gene set enrich- 
ment analysis (GSEA) (75). Within GSEA, false-discovery rates were used 
to adjust P values for multiple comparisons. 

Phylogenetic analysis. Whole-genome alignments for all 30 strains 
were constructed using Mauve (version 2.3.1) (20). The SNP matrix was 
parsed to remove positions with gaps and unknown characters using cus- 
tom Perl scripts. This matrix was used to infer an ML tree using RAxML 
(version 7.2.6) (21). The DNA sequences within each family were aligned 
using Muscle (24), and the alignments were used as an input to construct 
individual gene trees using RAxML (version 7.2.6) (21). As a distinct 
alternative, a majority rule consensus tree of all gene family trees that were 
conserved across all genomes in exactly one copy (2,025 gene families) was 
constructed using Phylip (version 3.69) (25). A concatenated codon align- 
ment was also constructed for the above-mentioned 2,025 genes. All syn- 
onymous positions from the concatenated codon alignment were ex- 
tracted using DnaSP (version 5/10/01) (76). An ML tree was constructed 
from the synonymous SNP matrix using RAxML (version 7.2.6) (21). 
Divergence times for the subspecies were estimated by calibrating the tree 
based on a divergence time of 140 million years for E. coli and Salmonella, 
as estimated by Ochman et al. (23) using MEGA (version 5.1) (77). 

Recombination analysis. Orthologous regions that were conserved 
across all genomes and at least 10 kb in length were extracted from the 
Mauve-based multiple-genome alignments. This resulted in extraction of 
eight fragments totaling 104 kb. This alignment was used as input for 
Clonal Frame (version 1.2) (27). Two independent runs, each consisting 
of 200,000 MCMC iterations, in which the first 100,000 iterations were 
discarded as burn-in, were performed. The runs were compared for con- 
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vergence using the Gelman and Rubin statistic (78), which was found to 
be satisfactory. 

As a second alternative, pairwise alignments of S. enterica serovar Ty- 
phi strain CT18 and all strains of non-subspecies enterica genomes, in- 
cluding S. bongori, were constructed using Mauve (version 2.3.1) (20) 
with S. Typhimurium LT2 as a reference. The pairwise alignments were 
parsed to show orthologous bases for all positions in the S. Typhimurium 
LT2 genome using custom Perl scripts. Regions in any strain where an 
orthologous region in S. Typhimurium LT2 was absent were discarded. 
This pseudo-multigenome alignment was used as input for the Recombi- 
nation Detection Program (RDP) (version 4.16) (30) to detect recombi- 
nation tracts in S. Typhimurium LT2. Nine different algorithms, RDP, 
Geneconv (79), Bootscan (80), Maxchi (81), chimaera (82), SiScan (83), 
PhylPro (84), LARD (85), and 3Seq (86), were used to determine recom- 
bination breakpoints across the alignment. Recombination events de- 
tected within the S. Typhimurium LT2 genome by at least 2 methods were 
annotated as possible recombination events. 
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